Introduction

DNA methylation

DNA methylation (DNAme) is a process by which methyl groups are added to the DNA molecule. Methylation could change the activity of the DNA without changing it’s code. It serves as a critical phenomenon to several biological functions and has a role in certain disease states. Further, DNAme is a privileged candidate for epigenetic inheritance due to its plasticity and stability across mitosis and meiosis. It is one of the major mechanisms of epigenetic modification and has a fundamental influence on gene expression, genomic stability and cellular activity [ ].

With the advancement of next-generation (NGS) sequencing techniques, there are several methods for studying DNA methylation, but, few offer a better resolution of methylation status such as bisulfite sequencing (also known as Bisulfite-Seq, BS-Seq, Methyl-Seq or WGBS). The key idea of the method is combining the power of high-throughput DNA sequencing with the treatment of DNA with sodium bisulfite. The human methylome contains around 28 million CG sites in humans. There are also non-CG methylated sites: CHG or CHH, where H represents adenine, thymine, or cytosine.

Whole Genome Bisulphite Sequencing

Whole Genome Bisulphite Sequencing (WGBS) is a gold standard NGS technique for studying DNAme in the genome of mammals. WGBS combines sodium bisulfite conversion of the DNA sequence with high throughput DNA sequencing. The sodium bisulfite reaction converts the unmethylated cytosines (C) to uracils (U), whereas, methylated C remains unchanged. These U are further converted to thymine (T) after the polymerase chain reaction (PCR). Although, no change occurs for methylated C. Methylcytosines are the methylated versions of the cytosine bases. This is done by transferring a methyl group onto the C5 position of the cytosine.

The aim of NGS-based DNA methylation analysis is to investigate genomic DNA and find out whether single cytosines or entire regions in the genome are methylated or not. WGBS is the most comprehensive sequencing for DNA methylation profiling, allowing single-base resolution of 5-mC within the whole genome. By comparing treated and untreated sequences, the location of the methylated cytosines is determined.

There are several applications of WGBS, including (not limited to): - detection of differentially methylated regions (DMRs) - detection of differentially methylated loci (DMLs) - detection of copy number variations (CNVs) - detection of single nucleotide polymorphisms (SNPs) - detection of cytosine methylation levels of transcription factor binding sites (TFBSs) - detecting methylation level or distribution (globally, single nucleotide, chromosome-wide and gene-centric)

Sequencing the whole genome is generally quite expensive. Therefore, although it has been applied to large genomes such as the human genome, large numbers of individual samples are seldom sequenced. Reduced representation bisulfite sequencing (RRBS) has been developed for this reason, in which the bisulfite reaction occurs but the sequencing is limited to around 1% of the genome. This enables the sequencing of the genome of several individuals.

Basic pipeline for WGBS data analysis

library(DiagrammeR)
grViz("
digraph flowchart {
graph [overlap = true, layout=dot,fontsize=12]

node [shape = box, style=filled, fillcolor=NavajoWhite, color=DarkslateGray, fontsize=12];
A; B; E; F; G;
node [shape = egg, style=filled, fillcolor=NavajoWhite, color=DarkslateGray, fontsize=12];
C; D;

A [label='Quality Check: FastQC']
B [label='Quality Control: trim_galore']
C [label='Alignment: Bismark']
D [label='Methylation extraction: Bismark']
E [label='Overview of QC: MultiQC']
F [label='Differential analysis: DMLs']
G [label='Differential analysis: DMRs']
edge[color=DarkslateGray,
  penwidth=5
]
A->B
B->C[label = '  trim adapters, remove shorter reads, trim Ns']
C->D
C->E
A->E
B->E
D->F[label = '  DMLs: single basepair resolution']
D->G[label = '  DMRs: identify regions']
}
")

Figure 1. Basic pipeline for WGBS data analysis.



Limitations of current methods

  • Methods are greatly hindered by low sample size. Most of the methods depend on large sample sizes. The number of tests performed = the number of loci analyzed. Example: Human genome has ~30 million CpGs [1].
  • Measurements are spatially correlated across the genome [2], but in most methods measurements from all loci are treated independently.
  • Multiple testing corrections without considering the spatial correlation can result in a loss of power.
  • DMRs more biologically relevant than DMLs.
  • DML approaches may construct DMRs by chaining together neighboring significant loci, but this type of approach will not yield a proper assessment of the statistical significance of the constructed regions, nor will the FDR be properly controlled [4]. Controlling the FDR at the level of individual loci is not the same as controlling FDR of regions.
  • Methylation data cannot be modeled either by Gaussian models (due to low coverage) or by Binomial models (do not account for biological variability). Beta-Binomial models are computationally difficult.

Challenges for assessing DMRs

  • defining region boundaries
  • methods ignore correlation across loci
  • biological variability from sample to sample
  • should be powerful even in the case of 2 samples per group.

Methods

Data

Two data sets were used.

One, representing negative control (with no expected DMRs), contained six methylation profiling samples from normal human dendritic cells (data in GEO). The samples were artificially divided into two groups of three samples, group 1 consisting of samples GSM1565940, GSM1565944, and GSM1565948, group 2 consisting of samples GSM1565942, GSM1565946, and GSM1565950. To obtain reasonable running times, only methylation data from chromosome 18 were used.

The second set of data was created from the negative control by adding 100 simulated DMRs using simDMRS function from R package dmrseq.

The data are the same as were used in the dmrseq paper

The reported DMRs were (if possible) filtered for only those that contain at least 10 CpGs and the mean difference between the two groups is at least 0.1 (the simulated regions had effect sizes between 0.163 and 0.450).

Data download and filtering

The methods used for comparison

In total 7 different methods were used for the comparison. Their main features, as well as links to the analysis files are summarized in the Table 1:

Method Preprocessing + statistical test used p-value estimate given CpG / non CpG Implementation language DOI Availability Analysis file
bsseq local-likelihood smoothing, t-test similar statistics no CpG only R 10.1186/gb-2012-13-10-r83 Bioconductor package bsseq Report
CGmapTools unpaired t-test yes CpG only C, C++, Python 10.1093/bioinformatics/btx595 binary package Report
defiant Weighted Welch Expansion yes CpG only Shell 10.1186/s12859-018-2037-1 binary package Report
DMRcaller no preprocessing, or pooling in bins, or kernel smoothing; Fisher’s exact or Score test yes all contexts R 10.1093/nar/gky602 Bioconductor package DMRcaller Report
dmrseq smoothing; generalized least squares yes CpG only R 10.1093/biostatistics/kxy007 Bioconductor package dmrseq Report
methPipe Fisher’s exact test no CpG only C, C++ 10.1371/journal.pone.0081148 binary package Report
methylKit Fisher’s exact test, or logistic regression with tiling yes all contexts R 10.1186/gb-2012-13-10-r87 Bioconductor package methylKit Report
Table 1. Metadata of the methods.



Results and discussion

Diagnostic plots

Distribution of the coverage and beta values of all the samples, shown in Figure 2A and Figure 2B and Figure 3A and Figure 3B, respectively, depicts that there is not much difference between individual samples. Also, the PCA plot, Figure 4A and Figure 4B, shows that there is no clustering of the two artificially created groups.

Coverage

Negative Control

Figure 2A. Coverage plot of negative control data.

Simulated data

Figure 2B. Coverage plot of simulated data.



Beta values

Negative Control

Figure 3A. Beta values histogram of negative control data.



Simulated data

Figure 3B. Beta values histogram of simulated data.



PCA plots

Negative Control

Figure 4A. PCA plot of negative control data.

Simulated data

Figure 4B. PCA plot of simulated data.

More diagnostic plots can be found here.

DMRs identified by the methods

There was no single region that would all methods overlap by at least 1bp.

Figure 5 shows one simulated DMR, generated as described in the Data section. This region has the highest delta (methylation difference) value in the list of all generated DMRs (0.447). This region was identified as DMR by 6 methods (not identified by methpipe; also not identified by DMRcaller bins method).

Figure 5. DMR with results from all methods (regions passing FDR of 0.05).



Similar plots for all 100 simulated DMRS are shown in this report.

Comparison of different methods

We considered any overlap between the simulated and identified DMRs to be sufficient for the region to be classified as true positive.Details of computing true positives and false positives are described in this report.

Negative Control

As per negative control, we do not expect any true DMRs, all the identified regions are thus false positives. dmrseq and defiant methods did not find any DMRs in the negative control data, whereas others did, as shown in Figure 6 and Table 2.

Barplot

Figure 6. Barplot of no. of regions identified in negative control data.



Table

Table 2. No. of regions and their total length, identified in negative control data.



Simulated Data

FDR vs Power

Figure 7 shows that dmrseq performs better than the other methods. On x axis is the observed FDR, on y axis the power, both for different specified FDR cut-offs. The computation of FDR and power is described here.

Figure 7. FDR vs Power plot.



FDR: Specified vs Observed

We have also checked how well the methods control FDR, Figure 8. For this, we plotted the observed FDR (the proportion of false discoveries) and the specified FDR (cutoff after multiple testing correction by BH). We can see that dmrseq again outperforms the other methods.

Figure 8. FDR: Specified vs Observed.



BSseq and methpipe do not allow user control of FDR, and are thus not depicted in the Figureure.

More comparison plots as well as the code can be found here.

Conclusion

In this project, we wanted to compare the methods for identification of DMRs, which were not compared in the dmrseq paper. We used nearly all methods available, except the ones which do not have a proper statistical description or have not been updated for a long time. We found that dmrseq outperforms all the methods that we compared. This is especially important when we take into acccount that we used just 3 samples per group, which is the most one can hope for in the real experiments (due to the high cost of WGBS). But, it has to be mentioned that the data we used were the same as in the original dmrseq paper and that the simulated DMRs were created using a function which is a part of the dmrseq method. It would be nice and useful to repeat the experiment with some independently generated data.

All the methods were well documented and were easy to use for a non-experienced user, except for CGmapTools, where the documentation was not clear.

SessionInfo

devtools::session_info()
## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  os       Ubuntu 16.04.5 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  de_DE.UTF-8                 
##  ctype    de_DE.UTF-8                 
##  tz       Europe/Zurich               
##  date     2019-01-11                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package      * version   date       lib source        
##  assertthat     0.2.0     2017-04-11 [1] CRAN (R 3.5.1)
##  backports      1.1.3     2018-12-14 [1] CRAN (R 3.5.1)
##  bindr          0.1.1     2018-03-13 [1] CRAN (R 3.5.1)
##  bindrcpp       0.2.2     2018-03-29 [1] CRAN (R 3.5.1)
##  brew           1.0-6     2011-04-13 [1] CRAN (R 3.5.1)
##  callr          3.1.1     2018-12-21 [1] CRAN (R 3.5.1)
##  cli            1.0.1     2018-09-25 [1] CRAN (R 3.5.1)
##  colorspace     1.3-2     2016-12-14 [1] CRAN (R 3.5.1)
##  crayon         1.3.4     2017-09-16 [1] CRAN (R 3.5.1)
##  desc           1.2.0     2018-05-01 [1] CRAN (R 3.5.1)
##  devtools       2.0.1     2018-10-26 [1] CRAN (R 3.5.1)
##  DiagrammeR   * 1.0.0     2018-03-01 [1] CRAN (R 3.5.1)
##  digest         0.6.18    2018-10-10 [1] CRAN (R 3.5.1)
##  downloader     0.4       2015-07-09 [1] CRAN (R 3.5.1)
##  dplyr          0.7.8     2018-11-10 [1] CRAN (R 3.5.1)
##  evaluate       0.12      2018-10-09 [1] CRAN (R 3.5.1)
##  fs             1.2.6     2018-08-23 [1] CRAN (R 3.5.1)
##  ggplot2        3.1.0     2018-10-25 [1] CRAN (R 3.5.1)
##  glue           1.3.0     2018-07-17 [1] CRAN (R 3.5.1)
##  gridExtra      2.3       2017-09-09 [1] CRAN (R 3.5.1)
##  gtable         0.2.0     2016-02-26 [1] CRAN (R 3.5.1)
##  hms            0.4.2     2018-03-10 [1] CRAN (R 3.5.1)
##  htmltools      0.3.6     2017-04-28 [1] CRAN (R 3.5.1)
##  htmlwidgets    1.3       2018-09-30 [1] CRAN (R 3.5.1)
##  igraph         1.2.2     2018-07-27 [1] CRAN (R 3.5.1)
##  influenceR     0.1.0     2015-09-03 [1] CRAN (R 3.5.1)
##  jsonlite       1.6       2018-12-07 [1] CRAN (R 3.5.1)
##  knitr          1.21      2018-12-10 [1] CRAN (R 3.5.1)
##  lazyeval       0.2.1     2017-10-29 [1] CRAN (R 3.5.1)
##  magrittr       1.5       2014-11-22 [1] CRAN (R 3.5.1)
##  memoise        1.1.0     2017-04-21 [1] CRAN (R 3.5.1)
##  munsell        0.5.0     2018-06-12 [1] CRAN (R 3.5.1)
##  pillar         1.3.1     2018-12-15 [1] CRAN (R 3.5.1)
##  pkgbuild       1.0.2     2018-10-16 [1] CRAN (R 3.5.1)
##  pkgconfig      2.0.2     2018-08-16 [1] CRAN (R 3.5.1)
##  pkgload        1.0.2     2018-10-29 [1] CRAN (R 3.5.1)
##  plyr           1.8.4     2016-06-08 [1] CRAN (R 3.5.1)
##  prettyunits    1.0.2     2015-07-13 [1] CRAN (R 3.5.1)
##  processx       3.2.1     2018-12-05 [1] CRAN (R 3.5.1)
##  ps             1.3.0     2018-12-21 [1] CRAN (R 3.5.1)
##  purrr          0.2.5     2018-05-29 [1] CRAN (R 3.5.1)
##  R6             2.3.0     2018-10-04 [1] CRAN (R 3.5.1)
##  RColorBrewer   1.1-2     2014-12-07 [1] CRAN (R 3.5.1)
##  Rcpp           1.0.0     2018-11-07 [1] CRAN (R 3.5.1)
##  readr          1.3.1     2018-12-21 [1] CRAN (R 3.5.1)
##  remotes        2.0.2     2018-10-30 [1] CRAN (R 3.5.1)
##  rgexf          0.15.3    2015-03-24 [1] CRAN (R 3.5.1)
##  rlang          0.3.0.1   2018-10-25 [1] CRAN (R 3.5.1)
##  rmarkdown      1.11      2018-12-08 [1] CRAN (R 3.5.1)
##  Rook           1.1-1     2014-10-20 [1] CRAN (R 3.5.1)
##  rprojroot      1.3-2     2018-01-03 [1] CRAN (R 3.5.1)
##  rstudioapi     0.8       2018-10-02 [1] CRAN (R 3.5.1)
##  scales         1.0.0     2018-08-09 [1] CRAN (R 3.5.1)
##  sessioninfo    1.1.1     2018-11-05 [1] CRAN (R 3.5.1)
##  stringi        1.2.4     2018-07-20 [1] CRAN (R 3.5.1)
##  stringr        1.3.1     2018-05-10 [1] CRAN (R 3.5.1)
##  testthat       2.0.1     2018-10-13 [1] CRAN (R 3.5.1)
##  tibble         1.4.2     2018-01-22 [1] CRAN (R 3.5.1)
##  tidyr          0.8.2     2018-10-28 [1] CRAN (R 3.5.1)
##  tidyselect     0.2.5     2018-10-11 [1] CRAN (R 3.5.1)
##  usethis        1.4.0     2018-08-14 [1] CRAN (R 3.5.1)
##  viridis        0.5.1     2018-03-29 [1] CRAN (R 3.5.1)
##  viridisLite    0.3.0     2018-02-01 [1] CRAN (R 3.5.1)
##  visNetwork     2.0.5     2018-12-05 [1] CRAN (R 3.5.1)
##  withr          2.1.2     2018-03-15 [1] CRAN (R 3.5.1)
##  xfun           0.4       2018-10-23 [1] CRAN (R 3.5.1)
##  XML            3.98-1.16 2018-08-19 [1] CRAN (R 3.5.1)
##  yaml           2.2.0     2018-07-25 [1] CRAN (R 3.5.1)
## 
## [1] /home/ubuntu/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library